Accounting for hidden common causes when inferring cause and effect from observational data

نویسنده

  • David Heckerman
چکیده

Identifying causal relationships from observation data is difficult, in large part, due to the presence of hidden common causes. In some cases, where just the right patterns of conditional independence and dependence lie in the data—for example, Y-structures—it is possible to identify cause and effect. In other cases, the analyst deliberately makes an uncertain assumption that hidden common causes are absent, and infers putative causal relationships to be tested in a randomized trial. Here, we consider a third approach, where there are sufficient clues in the data such that hidden common causes can be inferred. Example and basic results We illustrate the approach with an example fromm genomics. We consider the task of a genome-wide association study (GWAS), wherein one tries to identify which genetic markers known as single nucleotide polymorphism (SNPs) causally influence some trait of interest (e.g., height). Figure 1a shows a generative model for the task. In many cases, the relationship between the causal SNPs and trait is well represented by multiple linear regression (unlike the special cases of dominance and recessiveness that we learn about in high-school biology). The hidden common causes of the SNPs (here represented by a single hidden node) often corresponds to family relatedness (close or distant) among the individuals in the cohort. A million or more SNPs can be measured, but only a relatively small fraction of them causally influence the trait. The goal of causal inference is to identify the SNPs that do. If there were no hidden common causes of the SNPs, one could distinguish causal from non-causal SNPs by applying univariate linear regression to assess the correlation between a SNP and trait, producing a P value based on, for example, a likelihood ratio test. The separation of causal and non-causal SNPs won’t be perfect, as some non-causal SNPs will have small P values by chance. Nonetheless, the distribution of P values among the non-causal SNPs should be uniform (we say the P values are calibrated), whereas the distribution of P values among the causal SNPs will be highly skewed to small values, allowing for a separation of causal from non-causal SNPs that is often useful in practice. When family relatedness is present, univariate linear regression fails because non-causal SNPs are correlated with the trait, As seen in Figure 1a, there are d-connecting paths between each non-causal SNP and the trait through the hidden variable. These so-called spurious associations clutter the results, leading researchers on expensive and time consuming wild goose chases. To address this problem, one could perform multiple linear regression conditioning on all causal SNPs. Unfortunately, we don’t know which SNPs are causal. Consequently, an approach now commonly used in the genomics community is to condition on all SNPs except for the one being tested for association. As there can be millions of SNPs in an analysis, L2 regularization is used to attenuate variance. Experiments with synthetic data (to be described in more detail) show that this approach of conditioning on all SNPs yields calibrated P values across many GWASs with a wide range of realistic Presented at the NIPS workshop on causal inference (NIPS 2017), Long Beach, CA, USA. ar X iv :1 80 1. 00 72 7v 1 [ cs .A I] 2 J an 2 01 8

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Inferring Hidden Causes

One of the important aspects of human causal reasoning is that from the time we are young children we reason about unobserved causes. How can we learn about unobserved causes from information about observed events? Causal Bayes nets provide a formal account of how causal structure is learned from a combination of associations and interventions. This formalism makes specific predictions about th...

متن کامل

Running head: Species interactions in Markov networks

5 Inferring species interactions from observational data is one of the most controversial tasks in 6 community ecology. One difficulty is that a single pairwise interaction can ripple through an 7 ecological network and produce surprising indirect consequences. For example, two 8 competing species would ordinarily correlate negatively in space, but this effect can be 9 reversed in the presence ...

متن کامل

Distinguishing between cause and effect

We describe eight data sets that together formed the CauseEffectPairs task in the Causality Challenge #2: Pot-Luck competition. Each set consists of a sample of a pair of statistically dependent random variables. One variable is known to cause the other one, but this information was hidden from the participants; the task was to identify which of the two variables was the cause and which one the...

متن کامل

Bayesian Algorithms for Causal Data Mining

We present two Bayesian algorithms CD-B and CD-H for discovering unconfounded cause and effect relationships from observational data without assuming causal sufficiency which precludes hidden common causes for the observed variables. The CD-B algorithm first estimates the Markov blanket of a node X using a Bayesian greedy search method and then applies Bayesian scoring methods to discriminate t...

متن کامل

Seeing the Unobservable – Inferring the Probability and Impact of Hidden Causes

The causal impact of an observable cause can only be estimated if assumptions are made about the presence and impact of possible additional unobservable causes. Current theories of causal reasoning make different assumptions about hidden causes. Some views assume that hidden causes are always present, others that they are independent of the observed causes. In two experiments we assessed people...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1801.00727  شماره 

صفحات  -

تاریخ انتشار 2018